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ABSTRACT 



An approach for automatically determining the accuracy of 
a pronunciation dictionary in a speech recognition system 
involves comparing an expected pronunciation representa- 
tion for a particular word from a pronunciation dictionary to 
one or more actual pronunciations of the particular word. An 
accuracy score for each of the phonemes that constitute the 
pronunciation of the particular word is determined from the 
comparison of the expected and actual pronunciations for 
the particular word. The accuracy score is evaluated against 
specified accuracy criteria to determine whether the 
expected pronunciation for the particular word satisfies the 
specified accuracy criteria. If the expected pronunciation 
does not satisfy the specified accuracy criteria for the 
particular word, then the expected pronunciation for the 
particular word in the pronunciation dictionary is identified 
as requiring updating. Manual or automated update mecha- 
nisms may then be employed to update the identified 
expected pronunciation representations to reflect the actual 
pronunciations. 

12 Claims, 6 Drawing Sheets 
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AUTOMATICALLY DETERMINING WORDS 

FOR UPDATING IN A PRONUNCIATION 
DICTIONARY IN A SPEECH RECOGNITION 
SYSTEM 

FIELD OF THE INVENTION 

The invention relates generally to speech recognition 
systems, and relates more specifically to an approach for 
evaluating the accuracy of a pronunciation dictionary in a 
speech recognition system. 

BACKGROUND OF THE INVENTION 

Most speech recognition systems use a pronunciation 
dictionary to identify particular words contained in received 
utterances. The term "utterance" is used herein to refer to 
one or more sounds generated either by humans or by 
machines. Examples of an utterance include, but are not 
limited to, a single sound, any two or more sounds, a single 
word or two or more words. In general, a pronunciation 
dictionary contains data that defines expected pronuncia- 
tions of utterances. When an utterance is received, the 
received utterance, or at least a portion of the received 
utterance, is compared to the expected pronunciations con- 
tained in the pronunciation dictionary. An utterance is rec- 
ognized when the received utterance, or portion thereof, 
matches the expected pronunciation contained in the pro- 
nunciation dictionary. 

One of the most important concerns with pronunciation 
dictionaries is to ensure that expected pronunciations of 
utterances defined by the pronunciation dictionary accu- 
rately reflect actual pronunciations of the utterances. If an 
actual pronunciation of a particular utterance does not match 
the expected pronunciation, the expected pronunciation of 
the particular utterance may no longer be useful for identi- 
fying the actual pronunciation of the particular utterance. 

Actual pronunciations of utterances can be misrepre- 
sented for a variety of reasons. For example, in fluent 
speech, some sounds may be systematically deleted or 
adjusted. An application may be installed across diverse 
geographic areas where users have different regional 
accents. Expected pronunciations tend to be somewhat user- 
dependent. Consequently, a change in the users of a par- 
ticular application can adversely affect the accuracy of a 
speech recognition system. This is attributable to different 
speech characteristics of users, such as different intonations 
and stresses in pronunciation. 

Conventionally, pronunciation dictionaries are updated 
manually to reflect changes in actual pronunciations of 
utterances in response to reported problems. When a change 
in an application or user prevents a speech recognition 
system from recognizing utterances, the problem is reported 
to the administrator of the speech recognition system. The 
administrator then identifies the problem utterances and 
manually updates the pronunciation dictionary to reflect the 
changes to the application or users. 

Manually updating a pronunciation dictionary to reflect 
changes to an application or users has several significant 
drawbacks. First, it relies upon problems being reported to 
the administrator of the speech recognition system. Prob- 
lems may exist for long periods of time before being 
reported. In some situations this can adversely affect the 
reputation of the enterprise using the speech recognition 
system. 

Furthermore, even after the problems are identified, a 
significant amount of human resources and may be required 
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to update the pronunciation dictionary, further extending the 
problem. For example, updating the pronunciation dictio- 
nary typically involves collecting a large amount of actual 
pronunciation data for the problem utterances. The actual 

5 pronunciation data is then processed and used to update the 
expected pronunciation data contained in the pronunciation 
dictionary. Meanwhile, the speech recognition system is 
unable to recognize the problem utterances until the system 
is updated, which can be very frustrating to customers and 
other users of the system. 

Based on the foregoing, there is a need for an automated 
approach for determining the accuracy of a pronunciation 
dictionary in a speech recognition system. 

There is a particular need for an automated approach for 
determining the accuracy of a pronunciation dictionary in a 

15 speech recognition system that identifies particular expected 
pronunciation representations that do not satisfy specified 
accuracy criteria and therefore need to be updated. 

There is a further particular need for an automated 
approach for determining the accuracy of a pronunciation 

20 dictionary in a speech recognition system that requires a 
reduced amount of human resources in the identification 
process. 

SUMMARY OF THE INVENTION 

25 The foregoing needs, and other needs and objects that will 
become apparent from the following description, are 
achieved by the present invention, which comprises, in one 
aspect, a method for determining the accuracy of a pronun- 
ciation dictionary in a speech recognition system. According 

30 to the method, an expected pronunciation representation for 
a particular utterance is retrieved from the pronunciation 
dictionary. Then, an accuracy score is generated for the 
expected pronunciation representation by comparing the 
expected pronunciation representation to a set of one or 
more actual pronunciations of the particular utterance. 

35 According to another aspect, a method is provided for 
automatically updating a pronunciation dictionary in a 
speech recognition system to reflect one or more changes to 
an actual pronunciation of a particular word that is repre- 
sented in the pronunciation dictionary. According to the 

40 method, an expected pronunciation representation for the 
particular word is retrieved from the pronunciation dictio- 
nary. An accuracy score is generated for the expected 
pronunciation representation by comparing the expected 
pronunciation representation to one or more actual pronun- 

45 ciations of the particular word. A determination is made 
whether the accuracy score for the expected pronunciation 
representation satisfies specified accuracy criteria. If the 
accuracy score for the expected pronunciation representa- 
tion does not satisfy the specified accuracy criteria, then the 

50 expected pronunciation representation is updated to reflect 
the one or more actual pronunciations. 

According to another aspect, a speech recognition appa- 
ratus is provided. The speech recognition apparatus com- 
prises a storage medium having a pronunciation dictionary 

55 stored thereon and a diagnostic mechanism communica- 
tively coupled to the storage medium. The diagnostic 
mechanism is configured to retrieve an expected pronuncia- 
tion representation for a particular utterance from the pro- 
nunciation dictionary. The diagnostic mechanism is further 

60 configured to generate an accuracy score for the expected 
pronunciation representation by comparing the expected 
pronunciation representation to a set of one or more actual 
pronunciations of the particular utterance. 

65 BRIEF DESCRIPTION OF THE DRAWINGS 

Embodiments are illustrated by way of example, and not 
by way of limitation, in the figures of the accompanying 
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drawings in which like reference numerals refer to similar includes an application 102 that interacts with a speech 

elements and in which: recognition system (SRS) 104. Application 102 is any ele- 

F1G. 1 is a block diagram of a system for automatically ment that uses the speech recognition services of SRS 104. 

determining the accuracy of a pronunciation dictionary in a Examples of application 102 include, but are not limited to, 

speech recognition system according to an embodiment. 5 a voice-activated system or a telephone-based service imple- 

FIG. 2A is a block diagram illustrating a portion of the meDled in form of 0D 5 or more computer programs or 

contents of a pronunciation dictionary according to an processes. Application ^102 is communicatively coupled to 

embodiment. SRS 104 by a link 106. 

ri „ ~ D . . ,. ... . ( , . . f SRS 104 includes a recognizer 108, a non-volatile storage 

FIG. 2B is a block diagram illustrating the contents of a 11A . . <T . * 1in , & 
, . ~ 7. 7 1 j- 10 110, contaming a pronunciation dictionary 112 and a pro- 
phoneme string configuration according to an embodiment. . to . _ . ; ino . r 
v . . nunciation diagnostic tool 114. Recognizer 108 is commu- 
ne. 3A is a block diagram illustrating comparing a nicat ively coupled to non-volatile storage 110 by a link 116. 
phoneme string representation of an expected pronunciation Diagnoslic tool n 4 fc communicatively coupled to non- 
of word to a first actual pronunciation of a word according vo|alile slorage uo by a Unk m Unks n6> U8 may te 
to an embodiment. ]5 implemented using any mechanism to provide for the 

FIG. 3B is a block diagram illustrating comparing a exchange of data between their respective connected enti- 

phoneme string representation of an expected pronunciation t i es Examples of links 116, 118 include, but are not limited 

of word to a second actual pronunciation of a word accord- t0j network connections, wires, fiber-optic links and wireless 

ing to an embodiment. communications links. Non-volatile storage 110 may be, for 

FIG. 3C is a block diagram illustrating comparing a 20 example, one or more disks, 

phoneme string representation of an expected pronunciation Recognizer 108 is a mechanism that is configured to 

of word to a third actual pronunciation of a word according recognize received utterances using pronunciation dictio- 

to an embodiment. nary 112. Recognizer 108 may also require interaction with 

FIG. 4 is a table illustrating determining an accuracy score other components in SRS 104 that are not illustrated or 

for phoneme strings according to an embodiment. 25 described herein so as to avoid obscuring the various fea- 

FIG. 5 is a flow diagram of a process for automatically tures and aspects of the invention, 

determining the accuracy of a pronunciation dictionary Pronunciation dictionary 112 contains data that defines 

according to an embodiment. expected pronunciations for utterances that can be recog- 

FIG. 6 is a block diagram of a computer system on which nized by SRS 104. Pronunciation dictionary 112 is described 

embodiments may be implemented. 30 in more detail in this document. 

According to an embodiment, pronunciation diagnostic 

DETAILED DESCRIPTION OF THE tQol m ig configured to automatically determine the accu- 

PREFERRED EMBODIMENT racy of pron unciation dictionary 112 and identify particular 

In the following description, for the purposes of expected pronunciations that do not satisfy specified accu- 

explanation, specific details are set forth in order to provide 35 racy criteria. The expected pronunciations that do not satisfy 

a thorough understanding of the invention. However, it will the specified accuracy criteria may then be updated to more 

be apparent that the invention may be practiced without accurately reflect the actual pronunciations of received utter- 

these specific details. In some instances, well-known struc- ances. 

tures and devices are depicted in block diagram form in SRS 104 may include other components not illustrated 

order to avoid unnecessarily obscuring the invention. 40 and described herein to avoid obscuring the various aspects 

Various aspects and features of example embodiments are and features of the invention. For example, SRS 104 may 

described in more detail in the following sections: (1) include various software development tools and application 

introduction; (2) system overview; (3) pronunciation repre- testing tools available to aid in the development process, 

sentation; (4) determining the accuracy of a pronunciation One such tool is a commercially-available package of reus- 

dictionary; and (5) implementation mechanisms. 45 able speech software modules known as DialogModules™, 

1. Introduction provided by Speechworks International, Inc. of Boston, 
An approach for automatically determining the accuracy Mass. 

of a pronunciation dictionary in a speech recognition system 3. Pronunciation Representation 

is described. In general, an expected pronunciation repre- FIG. 2 A is a block diagram 200 that illustrates an example 

sentation for a particular utterance from a pronunciation 50 implementation of pronunciation dictionary 112. Other 

dictionary is compared to actual pronunciations of the implementations of pronunciation dictionary 112 may be 

particular utterance. An accuracy score for the particular used and the invention is not limited to any particular 

utterance is determined from the comparison of the expected implementation of pronunciation dictionary 112. 

and actual pronunciations of the particular utterance. The For purposes of explanation, various embodiments are 

accuracy score is evaluated against specified accuracy cri- 55 described herein in the context of recognizing words, 

teria to determine whether the expected pronunciation for However, embodiments of the invention are applicable to 

the particular utterance satisfies the specified accuracy cri- any type of utterance. In the present example, pronunciation 

teria. If the expected pronunciation does not satisfy the dictionary 112 contains one or more entries 202, each of 

specified accuracy criteria for the particular utterance, then which corresponds to a particular expected pronunciation for 

the expected pronunciation for the particular utterance in the 60 a particular word. Each entry 202 includes a word identifier 

pronunciation dictionary is identified as requiring updating. value and expected pronunciation representation data. 

Manual or automated update mechanisms may then be A word identifier value is any data that specifies a 

employed to update the identified expected pronunciation particular word with which an entry 202 is associated. For 

representations to reflect the actual pronunciations. example, a word identifier may be the actual word with 

2. System Overview 65 which a particular entry 202 is associated, such as 
FIG. 1 illustrates a system 100 used herein to describe "CAROUSEL," "APPLE" or "ZOO." As another example, 

various aspects and features of the invention. System 100 a word identifier value may be data other than the word 
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itself, such as W0RD1 or WORD2, that allows an entry 202 pronunciation. For example, in FIG. 3A, score SI is indica- 

to be mapped to a particular word. The invention is not tive of the correlation between phoneme PI and the first 

limited to any particular implementation of word identifier actual pronunciation. A high score typically indicates a 

values. higher correlation than a relatively lower score. For 

Expected pronunciation representation data is any data 5 example, a score of (0.90) may indicate a relatively higher 

that specifies an expected pronunciation for the word asso- correlation between a particular phoneme and an actual 

ciated with the entry that contains the expected pronuncia- pronunciation than a score of (0.30). In the example in FIG. 

tion representation data. According to one embodiment, 3 A, the first actual pronunciation of the particular word very 

expected pronunciation representation data specifies one or closely matches the expected pronunciation of the particular 

more phonemes, also referred to herein as a "phoneme 10 word, represented by phoneme string 302. The close match 

string." As used herein, the term "phoneme" refers to the is reflected in a relatively high set of scores SI, S2, S3, 

smallest distinguishable sound in a dialect of a language. S4 . . . SN. Thus, it is likely that phoneme string 302 will be 

For example, entry 204 is associated with word identifier useful to recognizer 108 (FIG. 1) for recognizing the first 

value WORD1 and contains expected pronunciation rep re- actual pronunciation of the particular word, 

sentation data DATA1 that defines an expected pronuncia- is In the block diagram 310 of FIG. 3B, the expected 

tion for WORD1. FIG. 2B is a block diagram that illustrates pronunciation is evaluated against a second actual pronun- 

an example phoneme string 208 for DATA1 according to an ciation of the particular word. In this example, the expected 

embodiment. Phoneme string 208 includes N number of pronunciation has a high correlation to the second actual 

phonemes, identified as PI, P2, P3 through PN. Phoneme pronunciation except with respect to phoneme P3, as indi- 

string 208 defines an expected pronunciation for WORD1. 20 cated by score S3. Thus, score S3 is a relatively lower score 

Phoneme string 208 may contain any number of phonemes than, for example, score SI. The relatively lower score for 

and the invention is not limited to phoneme strings of any score S3 compared to score S2 indicates that phoneme P3 

particular length. was not as strongly represented as phoneme PI in the second 

As illustrated in FIG. 2, some words in pronunciation actual pronunciation of the particular word. Nevertheless, 

dictionary 212, such as WORD1 and WORD4, have only a 25 since the expected pronunciation scored well with respect to 

single entry 202 and therefore only a single expected pro- most of the phonemes, it is likely that phoneme string 302 

nunciation. Other words have multiple expected pronuncia- will be useful to recognizer 108 (FIG. 1) for recognizing the 

tions. For example, WORD2 has three entries 202 and second actual pronunciation of the particular word, 

therefore three expected pronunciations. WORD3 has two In the block diagram 320 of FIG. 3C f the expected 

expected pronunciations and WORDS has four expected 30 pronunciation is evaluated against a third actual pronuncia - 

pronunciations. Thus, pronunciation dictionary 112 may tion of the particular word. In this example, it is assumed 

specify any number of pronunciations for any number of that the expected pronunciation, represented by phoneme 

words and the invention is not limited to pronunciation string 302, does not score well with respect to the third 

dictionaries having any number of words or any number of actual pronunciation of the particular word. That is, that 

expected pronunciations for a particular word. 35 there is a relatively low correlation between the phonemes 

4. Determining the Accuracy of a Pronunciation Dictionary contained in phoneme string 302 and the third actual pro- 

According to one embodiment, the accuracy of pronun- nunciation of the particular word. The consequence of the 

ciation dictionary 112 is automatically determined by com- significant differences between the expected pronunciation 

paring a first set of phoneme strings contained in pronun- and the third actual pronunciation is that phoneme string 302 

ciation dictionary 112, which represent expected 40 is unlikely to be useful to recognizer 108 (FIG. 1) for 

pronunciations of words, to actual pronunciations of the recognizing the third actual pronunciation of the particular 

words. Phoneme strings contained in the pronunciation word. 

dictionary are scored for accuracy based upon the compari- B. SCORING PHONEME STRINGS 

son to the actual pronunciations. The accuracy scores are Once phoneme strings from a pronunciation dictionary 

evaluated against specified accuracy criteria to identify 45 have been compared to actual pronunciations of words, the 

phoneme strings contained in the pronunciation dictionary phoneme strings are scored for accuracy. According to one 

that need to be updated to more accurately reflect actual embodiment, the accuracy of a particular phoneme string 

pronunciations. with respect to a particular actual pronunciation is based 

A: COMPARING UXFLIlTED AND ACTUAL PRO^^upon the scores for each phoneme contained in the expected 

NUNCIAT IONS USING PHONEME STRINGS 50>)phoneme string. For example, in FIG. 3A, expected pbo- 
3A, HU. iB, and FIG. 3C are block UiUguws 300r**^ neme string 302 might receive a score of (1.00) to indicate 

310, 320, respectively, that illustrate an approach for auto- that the first actual pronunciation very closely matched the 

matically determining the accuracy of an expected pronun- expected pronunciation. In FIG. 3B, the second actual 

ciation representation from a pronunciation dictionary pronunciation did not match the expected pronunciation as 

according to an embodiment. Phoneme string 302 represents 55 well. Accordingly, expected phoneme string 302 would 

an expected pronunciation of a particular word and includes receive a relatively lower score with respect to the second 

phonemes PI, P2, P3, P4 through PN. actual pronunciation, for example (0.80) or (0.90). In FIG. 

According to an embodiment, phoneme string 302 is 3C, the third actual pronunciation very poorly matched the 

compared to a first actual pronunciation of the particular expected pronunciation. Accordingly, expected pronuncia- 

word on a phoneme-by-phoneme basis to determine how 60 tion string 302 would receive a relatively low score with 

well the expected pronunciation of the particular word respect to the third actual pronunciation, for example, (0.10) 

compares estimates the first actual pronunciation of the or (0.20). 

particular word. The first actual pronunciation of the par- Once a particular phoneme string has been scored for one 

ticular word is projected onto phoneme string 302 and a set or more actual pronunciations, the scores are evaluated 

of scores SI, S2, S3, S4 . . . SN, represented by reference 65 against specified accuracy criteria to determine whether the 

numeral 304, are determined. Each score indicates a corre- particular phoneme string needs to be updated to more 

lation between a particular phoneme and the first actual accurately reflect actual pronunciations of the associated 
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word. According to one embodiment, the specified accuracy 
criteria includes a minimum average score threshold that 
corresponds to a minimum average correlation between 
phonemes and one or more actual pronunciations. If the 
average score for a particular phoneme with respect to one 5 
or more actual pronunciations is less than the minimum 
average score threshold, then the phoneme string associated 
with the particular phoneme does not satisfy the specified 
accuracy criteria and needs to be updated to more accurately 
reflect actual pronunciations of the corresponding word. 10 

According to another embodiment, the specified accuracy 
criteria includes a minimum score threshold that specifies 
the minimum acceptable score (correlation) for a particular 
phoneme with respect to any actual pronunciations. If the 
score for the particular phoneme with respect to any actual is 
pronunciation is less than the minimum score threshold, then 
the phoneme string associated with the particular phoneme 
does not satisfy the specified accuracy criteria and needs to 
be updated to more accurately reflect actual pronunciations 
of the corresponding word. The minimum score threshold 20 
may also require that a specified number or fraction of scores 
meet or exceed the minimum score threshold for the accu- 
racy criteria to be satisfied. 

The scoring of phoneme strings is now described in more 
detail with reference to a table 400 of FIG. 4. In general, 25 
table 400 contains the results of evaluating a particular 
phoneme string for a particular word, consisting of pho- 
nemes PI, P2, P3, P4 and P5, against three actual pronun- 
ciations of the particular word. Table 400 includes five 
entries 402, 404, 406, 408, 410 that correspond to the testing 30 
of the five phonemes PI, P2, P3, P4 and P5, respectively, 
against three actual pronunciations. 

Each phoneme is evaluated against three actual 
pronunciations, using the approach previously described 
herein and the results are reported in columns 412. The 35 
average score for each phoneme against all three actual 
pronunciations is reported in column 414 and is calculated 
from the scores in columns 412. An example minimum 
average score threshold of (0.50) and an example minimum 
score threshold of (0.30) are reported in columns 416, 418, 40 
respectively, for each phoneme. The number of scores for 
each phoneme that fall below the minimum score threshold 
is reported in column 420. 

As illustrated by entries 402, 406, corresponding to the 
first and third phonemes, respectively, the average scores for 45 
the first and third phonemes of (0.79) and (0.93), 
respectively, satisfy the minimum average score threshold of 
(0.50). Furthermore, none of the scores for the first or third 
phonemes are below the minimum score threshold of (0.30). 
Accordingly, both the first and third phonemes satisfy the 50 
specified accuracy criteria. 

As illustrated by entry 404, corresponding to the second 
phoneme P2, the average score of (0.47) for the second 
phoneme P2 falls below the minimum average score thresh- 
old of (0.50). Therefore, the second phoneme does not 55 
satisfy the specified accuracy criteria. 

As illustrated by entries 408, 410, corresponding to the 
fourth and fifth phonemes, respectively, the average scores 
for both the fourth and fifth phonemes of (0.68) and (0.61), 
respectively, satisfy the minimum average score threshold of 60 
(0.50). However, the fourth and fifth phonemes have one and 
two scores, respectively, that fall below the minimum score 
threshold. Accordingly, the fourth and fifth phonemes cause 
the particular phoneme string to not satisfy the specified 
accuracy criteria. This example illustrates different tech- 65 
niques that may be applied to evaluate the scores for 
phonemes in a phoneme string. It is understood that the 
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score for a single phoneme may cause the associated pho- 
neme string to be updated. For example, since the average 
score of (0.47) for phoneme P2 falls below the average score 
threshold of (0.50), the associated phoneme string needs to 
be updated and the scores for the other phonemes do not 
have to be evaluated. 

This example assumes that under the specified accuracy 
criteria, a single score below the minimum score threshold 
will not satisfy the specified accuracy criteria. In other 
circumstances, the specified accuracy criteria may specify 
that a particular number of scores below the minimum score 
threshold causes a phoneme string to fail the specified 
accuracy criteria. For example, the specified accuracy cri- 
teria may specify that only a phoneme string that has two or 
more phoneme scores below the minimum score threshold 
does not satisfy the specified accuracy criteria. Under these 
circumstances, the scores for the fourth phoneme would not 
cause the particular phoneme string to not satisfy the speci- 
fied accuracy criteria, but the scores for the fifth phoneme 
would. 

The approach for determining the accuracy of a particular 
expected pronunciation from a pronunciation dictionary in a 
speech recognition system is now described with reference 
to both table 400 of FIG. 4 and a flow diagram 500 of FIG. 
5. After starting in step 502, in step 504, a particular 
expected pronunciation representation is retrieved from a 
pronunciation dictionary, for example, pronunciation dictio- 
nary 112 of FIG. 1. Expected representations contained in 
pronunciation dictionary 112 may be selectively retrieved 
and evaluated or systematically retrieved and evaluated as 
part of a regular pronunciation dictionary "tuning" proce- 
dure. 

In step 508, the particular expected pronunciation repre- 
sentation is compared to the one or more actual pronuncia- 
tions and accuracy scores are determined for the particular 
expected pronunciation. For example, as indicated by table 
400, the phonemes in the particular pronunciation represen- 
tation are evaluated against three actual pronunciations. An 
accuracy score is determined for each phoneme with respect 
to each actual pronunciation based upon how well the actual 
pronunciations correlate to the phonemes. For example, an 
average accuracy score of (0.79) for the first phoneme with 
respect to the three actual pronunciations is stored in column 
414. 

In step 510, the accuracy scores are evaluated against 
specified accuracy criteria. For example, the average accu- 
racy score for the first phoneme of (0.79) is compared to the 
minimum average score threshold of (0.50) in column 416. 
In addition, the accuracy scores or the first phoneme with 
respect to the three actual pronunciations of (0.90), (0.80) 
and (0.67), respectively, are compared to the minimum score 
threshold of (0.30) from column 18. 

In step 512, a determination is made whether the average 
score satisfies the minimum average score threshold. If not, 
then the particular expected pronunciation representation 
does not satisfy the specified accuracy criteria and in step 
514, the particular expected representation is updated. For 
example, the average accuracy score of (0.47) for the second 
phoneme P2, as represented by entry 404, is below the 
average score threshold of (0.50). 

If in step 512, the average score does satisfy the minimum 
average score threshold, then in step 516, a determination is 
made whether the accuracy scores for the particular 
expected pronunciation representation satisfy the minimum 
score threshold. As previously described, if N number of the 
accuracy scores for the particular expected pronunciation 
representation fall below the minimum score threshold, then 
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the specified accuracy criteria are not satisfied. If not, then tion dictionary. According to one embodiment of the 

control proceeds to step 514 where the particular expected invention, a mechanism for automatically determining the 

representation is updated to reflect the actual pronunciations accuracy of a pronunciation dictionary is provided by com- 

of the corresponding word. For example, the fourth pho- puter system 600 in response to processor 604 executing one 

neme has one accuracy score below the minimum score 5 or more sequences of one or more instructions contained in 

threshold while the fifth phoneme has two accuracy scores main memory 606. Such instructions may be read into main 

below the minimum score threshold. If, however, in step memory 606 from another computer-readable medium, such 

516, the accuracy scores satisfy the minimum score as storage device 610. Execution of the sequences of instruc- 

threshold, then the specified accuracy criteria are satisfied. tions contained in main memory 606 causes processor 604 

The process is then complete in step 518. 10 to perform the process steps described herein. One or more 

Although embodiments have been primarily described processors in a multi -processing arrangement may also be 

herein in the context of determining the accuracy of employed to execute the sequences of instructions contained 

expected pronunciations of words, the approach described in main memory 606. In alternative embodiments, hard- 

herein may be used with any type of utterance and the wired circuitry may be used in place of or in combination 

invention is not limited to the context of words. 15 with software instructions to implement the invention. Thus, 

5. Implementation Mechanisms embodiments of the invention are not limited to any specific 

A. OVERVIEW combination of hardware circuitry and software. 

The approach described herein for automatically deter- The term "computer-readable medium" as used herein 

mining the accuracy of a pronunciation dictionary in a refers to any medium that participates in providing instruc- 

speech recognition system may be implemented in computer 20 tions to processor 604 for execution. Such a medium may 

software, in hardware circuitry, or as a combination of take many forms, including but not limited to, non-volatile 

computer software and hardware circuitry. Accordingly, the media, volatile media, and transmission media. Non-volatile 

invention is not limited to a particular computer software or media includes, for example, optical or magnetic disks, such 

hardware circuitry implementation. For example, as illus- as storage device 610. Volatile media includes dynamic 

trated in FIG. 1, the approach may be implemented in 25 memory, such as main memory 606. Transmission media 

pronunciation diagnostic tool 114 as part of SRS 104. As includes coaxial cables, copper wire and fiber optics, includ- 

another example, the approach may be implemented as part ing the wires that comprise bus 602. Transmission media can 

of recognizer 108. The approach may also be implemented also take the form of acoustic or light waves, such as those 

as a stand-alone mechanism located external to SRS 104 that generated during radio wave and infrared data communica- 

is periodically used to assess the accuracy of pronunciation 30 tions. 

dictionary 112 and provide recommendations for expected Common forms of computer-readable media include, for 

pronunciation representations in pronunciation dictionary example, a floppy disk, a flexible disk, hard disk, magnetic 

112 that do not satisfy the specified accuracy criteria. tape, or any other magnetic medium, a CD-ROM, any other 

B. IMPLEMENTATION HARDWARE optical medium, punch cards, paper tape, any other physical 
FIG. 6 is a block diagram that illustrates an example 35 medium with patterns of holes, a RAM, a PROM, and 

computer system 600 upon which an embodiment of the EPROM, a FLASH-EPROM, any other memory chip or 
invention may be implemented. Computer system 600 cartridge, a carrier wave as described hereinafter, or any 
includes a bus 602 or other communication mechanism for other medium from which a computer can read, 
communicating information, and a processor 604 coupled Various forms of computer readable media may be 
with bus 602 for processing information. Computer system 40 involved in carrying one or more sequences of one or more 
600 also includes a main memory 606, such as a random instructions to processor 604 for execution. For example, the 
access memory (RAM) or other dynamic storage device, instructions may initially be carried on a magnetic disk of a 
coupled to bus 602 for storing information and instructions remote computer. The remote computer can load the instruc- 
to be executed by processor 604. Main memory 606 also tions into its dynamic memory and send the instructions over 
may be used for storing temporary variables or other inter- 45 a telephone line using a modem. A modem local to computer 
mediate information during execution of instructions to be system 600 can receive the data on the telephone line and 
executed by processor 604. Computer system 600 further use an infrared transmitter to convert the data to an infrared 
includes a read only memory (ROM) 608 or other static signal. An infrared detector coupled to bus 602 can receive 
storage device coupled to bus 602 for storing static infer- the data carried in the infrared signal and place the data on 
mation and instructions for processor 604. A storage device so bus 602. Bus 602 carries the data to main memory 606, from 
610, such as a magnetic disk or optical disk, is provided and which processor 604 retrieves and executes the instructions, 
coupled to bus 602 for storing information and instructions. The instructions received by main memory 606 may option- 
Computer system 600 may be coupled by bus 602 to a ally be stored on storage device 610 either before or after 
display 612, such as a cathode ray tube (CRT), for displaying execution by processor 604. 

information to a computer user. An input device 614, includ- 55 Computer system 600 also includes a communication 
ing alphanumeric and other keys, is coupled to bus 602 for interface 618 coupled to bus 602. Communication interface 
communicating information and command selections to 618 provides a two-way data communication coupling to a 
processor 604. Another type of user input device is cursor network link 620 that is connected to a local network 622. 
control 616, such as a mouse, a trackball, or cursor direction For example, communication interface 618 may be an 
keys for communicating direction information and com- 60 integrated services digital network (ISDN) card or a modem 
mand selections to processor 604 and for controlling cursor to provide a data communication connection to a corre- 
movement on display 612. This input device typically has sponding type of telephone line. As another example, corn- 
two degrees of freedom in two axes, a first axis (e.g., x) and munication interface 618 may be a local area network 
a second axis (e.g., y), that allows the device to specify (LAN) card to provide a data communication connection to 
positions in a plane. 65 a compatible LAN. Wireless links may also be implemented. 

The invention is related to the use of computer system 600 In any such implementation, communication interface 618 

for automatically determining the accuracy of a pronuncia- sends and receives electrical, electromagnetic or optical 
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signals that carry digital data streams representing various 
types of information. 

Network link 620 typically provides data communication 
through one or more networks to other data devices. For 
example, network link 620 may provide a connection 
through local network 622 to a host computer 624 or to data 
equipment operated by an Internet Service Provider (ISP) 
626. ISP 626 in turn provides data communication services 
through the world wide packet data communication network 
now commonly referred to as the "Internet" 628. Local 
network 622 and Internet 628 both use electrical, electro- 
magnetic or optical signals that carry digital data streams. 
The signals through the various networks and the signals on 
network link 620 and through communication interface 618, 
which carry the digital data to and from computer system 
600, are exemplary forms of carrier waves transporting the 
information. 

Computer system 600 can send messages and receive 
data, including program code, through the network(s), net- 
work link 620 and communication interface 618. In the 
Internet example, a server 630 might transmit a requested 
code for an application program through Internet 628, ISP 
626, local network 622 and communication interface 618. In 
accordance with the invention, one such downloaded appli- 
cation provides for automatically determining the accuracy 
of a pronunciation dictionary as described herein. 

The received code may be executed by processor 604 as 
it is received, and/or stored in storage device 610, or other 
non-volatile storage for later execution. In this manner, 
computer system 600 may obtain application code in the 
form of a carrier wave. 

The approach described in this document for automati- 
cally determining the accuracy of a pronunciation dictionary 
provides several benefits, and advantages over prior 
approaches. In particular, the use of an automated mecha- 
nism reduces the amount of human resources required to 
determine the accuracy of a pronunciation dictionary. This 
allows the accuracy of a pronunciation dictionary to be 
periodically assessed and corrected without having to wait 
for users to identify problems with particular words. 
Moreover, the automated approach allows a pronunciation 
dictionary to be more quickly updated to reflect changes to 
an application, users or context than prior manual 
approaches. The automated nature of the approach may also 
increase the accuracy of pronunciation dictionary 112 since: 
(1) the approach can account for properties of speech 
recognition system 104; and (2) manually-adjusted pronun- 
ciations can be less accurate because of biased linguistic 
preconceptions. 

In the foregoing specification, particular embodiments 
have been described. It will, however, be evident that 
various modifications and changes may be made thereto 
without departing from the broader spirit and scope of the 
invention. The specification and drawings are, accordingly, 
to be regarded in an illustrative rather than a restrictive 
sense. 

What is claimed is: 

1. A method of determining the accuracy of a pronuncia- 
tion dictionary so that the dictionary may be updated to 
improve its accuracy, comprising: 

providing a pronunciation dictionary having a plurality of 
entries, wherein each entry includes a word identifier 
and at least one phoneme string of an expected pro- 
nunciation of a word, each phoneme string having a 
plurality of phonemes; 

receiving a plurality of actual utterances of a specific 
word from a plurality of users; 
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comparing each of the utterances to a phoneme string in 
the dictionary to generate a corresponding phoneme 
string score, wherein each phoneme string score indi- 
cates on a phoneme-by-phoneme basis the accuracy of 
5 the received utterance relative to the compared pho- 
neme string; 

evaluating the phoneme string scores to predetermined 
accuracy criteria to identify entries in the dictionary 
that should be updated. 
10 2. The method of claim 1 wherein the phoneme string 
score has a phoneme score for each phoneme in the phoneme 
string, each phoneme score being indicative of the correla- 
tion between a phoneme in the phoneme string and a 
corresponding phoneme in the actual utterance. 
15 3. The method of claim 2 wherein the method further 
comprises 

computing, for each phoneme in the phoneme string, an 
average phoneme score from the corresponding pho- 
neme scores of each of the actual utterances; 

20 

determining if any of the average phoneme scores is 

below a threshold value; 
if so, identifying the corresponding entry in the dictionary 
that has the phoneme string as needing updating. 
25 4. The method of claim 2 wherein the method further 
comprises comparing the phoneme scores to a minimum 
score threshold and identifying the corresponding entry in 
the dictionary that has the phoneme string as needing 
updating if at least one of the phonemes in the string has a 
30 specified number of instances in which the phoneme score is 
below the minimum score threshold. 

5. A computer readable medium carrying one or more 
sequences of instructions for determining the accuracy of a 
pronunciation dictionary so that the dictionary may be 
35 updated to improve its accuracy, the one or more sequences 
of instructions including instructions which, when executed 
by one or more processors, perform the steps of; 
providing a pronunciation dictionary having a plurality of 
entries, wherein each entry includes a word identifier 
40 and at least one phoneme string of an expected pro- 
nunciation of a word, each phoneme string having a 
plurality of phonemes; 
receiving a plurality of actual utterances of a specific 
word from a plurality of users; 
45 comparing each of the utterances to a phoneme string in 
the dictionary to generate a corresponding phoneme 
string score, wherein each phoneme string score indi- 
cates on a phoneme-by-phoneme basis the accuracy of 
the received utterance relative to the compared pho- 
50 neme string; 

evaluating the phoneme string scores to predetermined 
accuracy criteria to identify entries in the dictionary 
that should be updated. 
55 6. The computer readable medium of claim 5 wherein the 
phoneme string score has a phoneme score for each pho- 
neme in the phoneme string, each phoneme score being 
indicative of the correlation between a phoneme in the 
phoneme string and a corresponding phoneme in the actual 
6Q utterance. 

7. The computer readable medium of claim 6 wherein the 
instructions further perform the steps of 
computing, for each phoneme in the phoneme string, an 
average phoneme score from the corresponding pho- 
55 neme scores of each of the actual utterances; 

determining if any of the average phoneme scores is 
below a threshold value; 
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if so, identifying the corresponding entry in the dictionary logic to evaluate the phoneme string scores to predeter- 

that has the phoneme string as needing updating. mined accuracy criteria to identify entries in the dic- 

8. The computer readable medium of claim 6 wherein the tionary that should be updated. 

instructions further perform the steps of 10 The speech recognition diagnostic tool of claim 9 

. • . c wherein the logic to compare includes logic to generate a 

comparing the phoneme scores to a minimum score 5 , t : & . ■ . ^ & f , 

h h ih h phoneme string score having a phoneme score for each 

res o an phoneme in the phoneme string, each phoneme score being 

identifying the corresponding entry in the dictionary that indicative of the correlation between a phoneme in the 

has the phoneme string as needing updating if at least phoneme string and a corresponding phoneme in the actual 

one of the phonemes in the string has a specified utterance. 

number of instances in which the phoneme score is 30 11. The speech recognition diagnostic tool of claim 10 

below the minimum score threshold. further comprising 

9. A speech recognition diagnostic tool to determine the logic to compute, for each phoneme in the phoneme 
accuracy of a pronunciation dictionary so that the dictionary string, an average phoneme score from the correspond- 
may be updated to improve its accuracy, comprising: ing phoneme scores of each of the actual utterances; 

a pronunciation dictionary having a plurality of entries, 15 logic to determine if any of the average phoneme scores 

wherein each entry includes a word identifier and at is below a threshold value and, if so, to identify the 

least one phoneme string of an expected pronunciation corresponding entry in the dictionary that has the 

of a word, each phoneme string having a plurality of phoneme string as needing updatmg. 

phonemes* speech recognition diagnostic tool of claim 10 

. 1 . - . - 20 further comprising 

logic to receive a plura hty of actual utterances of a logic to compare the phoneme scores to a minimum score 

specific word from a plurality of users; threshold and to identify the corresponding entry in the 

logic to compare each of the utterances to a phoneme dictionary that has the phoneme string as needing 
string in the dictionary to generate a corresponding updating if at least one of the phonemes in the string 
phoneme string score, wherein each phoneme string 25 has a specified number of instances in which the 
score indicates on a phoneme-by-phoneme basis the phoneme score is below the minimum score threshold, 
accuracy of the received utterance relative to the com- 
pared phoneme string; ***** 



